Iteration 2

Due to the problem of extremely large database files being produced when extracting features with extremely good coverage, such as Gene Ontology, a new version of the code is required to deal with this problem. The new code will store the ProteinPairParser objects as pickle files and the features will be indexed from these objects through a __getitem__ method with the ProteinPairParser only interacting with it's database, if it has one.

Each ProteinPairParser will have it's own generator function which will either be created using the options handed to it or loaded from another pickle file. The default generator will act as the code currently does, by creating a database then indexing said database to retrieve files. According to the Python documentation if the Parser opens the database at initialisation and then is pickled the database will be closed and opened again at unpickling time: pickle documentation - see the TextReader example.


In [8]:
cd /home/gavin/Documents/MRes/


/home/gavin/Documents/MRes

Going to step through the changes I'm making to the code and list them using git. After making changes I'll run a test case to make sure it's still working.


In [9]:
import sys
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")
import ocbio.extract

In [11]:
reload(ocbio.extract)


Out[11]:
<module 'ocbio.extract' from '/home/gavin/Documents/MRes/opencast-bio/ocbio/extract.py'>

Changing parsers to use getitem

This involves ensuring that at initialisation the parser will define a function to return values by default from the database it's going to create when regenerate is run. There must also be an option to load an arbitrary pickled object to return items.

Also, the databases must now be opened and closed with the parsers and the opened databases stored within the parser objects. The assembler will have to be modified to deal with this and close the databases when it's done.

Showing these changes:


In [7]:
!git show HEAD


commit 9aa9c8ae4e00146ca0b3fc6ac510fbcf67034e77
Author: Gavin <gavingray1729@gmail.com>
Date:   Sun Jun 22 10:37:09 2014 +0100

    debugged, works now

diff --git a/ocbio/extract.py b/ocbio/extract.py
index b1c37f6..0bb7db8 100644
--- a/ocbio/extract.py
+++ b/ocbio/extract.py
@@ -8,6 +8,19 @@ import shelve
 import re
 import sys
 
+def verbosecheck(verbose):
+    '''returns a function depending on the state of the verbose flag'''
+    if verbose:
+        def v_print(*args):
+            '''declare v_print function that prints to stdout
+            if verbose flag is on'''
+            for argument in args:
+                print argument,
+                print
+    else:
+        def v_print(*args):
+            None
+    return v_print
 
 class FeatureVectorAssembler():
     '''Assembles feature vectors from protein pair files, data source lists
@@ -18,16 +31,7 @@ class FeatureVectorAssembler():
         # store the directory of the table and it's name
         self.sourcetabdir, self.tabfile = os.path.split(sourcetab)
 
-        if verbose:
-            def v_print(*args):
-                '''declare v_print function that prints to stdout
-                if verbose flag is on'''
-                for argument in args:
-                    print argument,
-                    print
-        else:
-            def v_print(*args):
-                None
+        v_print = verbosecheck(verbose)
 
         v_print("Using {0} from top data directory {1}.".format(self.sourcetabdir,
                                                                 self.tabfile))
@@ -83,6 +87,7 @@ class FeatureVectorAssembler():
     def regenerate(self, force=False, verbose=False):
         '''Calls all known protein parsers and gets them to regenerate their
         output, if they have to.'''
+        v_print = verbosecheck(verbose)
         v_print("Regenerating parsers:")
         for parser in self.parserlist:
             v_print("\t parser {0}".format(self.parserlist.index(parser)))
@@ -94,6 +99,7 @@ class FeatureVectorAssembler():
         '''Assembles a file of feature vectors for each protein pair in a
         protein pair file supplied.
         Assumes the pairfile is tab delimited.'''
+        v_print = verbosecheck(verbose)
         v_print("Reading pairfile: {0}".format(pairfile))
         # first parse the pairfile into a list of frozensets
         pairs = map(lambda l: frozenset(l), csv.reader(open(pairfile), delimiter="\t"))
@@ -224,6 +230,7 @@ class ProteinPairParser():
     def regenerate(self, force=False, verbose=False):
         '''Regenerate the pair file from the data source
         if the data source is newer than the pair file'''
+        v_print = verbosecheck(verbose)
         # so first check the ages of both files
         datamtime = os.stat(self.datadir)[-2]
         if os.path.isfile(self.outdir):

Testing initialisation:


In [16]:
testparser = ocbio.extract.ProteinPairParser("none","none",generator="geneontology/retreival.function.pickle")

In [ ]:
f = open("GOfeatures.class.test")
GOfeatures1 = pickle.load()

In [17]:
GOfeatures = None

In [12]:
ocbio.extract.ProteinPairParser?

Iteration 1

These are the notes on the development of the code described in the Feature vector assembly page of the wiki. First, the protein pair parser class will be written and tested on a dataset that has already been extracted. This will be the HIPPIE dataset.

Initialisation currently involves loading in the three command line options and saving them to the object. It must also involve parsing of the options. Testing the initialisation:


In [27]:
import os, time, subprocess, csv, shelve

In [1]:
cd /home/gavin/Documents/MRes/HIPPIE/


/home/gavin/Documents/MRes/HIPPIE

In [95]:
#define the class
class ProteinPairParser():
    '''Does simple parsing on data files to produce protein pair files with feature values'''
    def __init__(self,datadir,outdir,protindexes=(1,2),valindex=3,script=None,csvdelim="\t"):
        #first, store the initialisation
        self.datadir = datadir
        self.outdir = outdir
        self.protindexes=protindexes
        self.valindex=valindex
        self.script=script
        self.csvdelim=csvdelim
        return None
    
    def regenerate(self):
        '''Regenerate the pair file from the data source
        if the data source is newer than the pair file'''
        # so first check the ages of both files
        datamtime = os.stat(self.datadir)[-2]
        if os.path.isfile(self.outdir):
            pairmtime = os.stat(self.outdir)[-2]
        else:
            #bit of a hack
            pairmtime = 0
        #if the data modification time is greater than output modification time
        if datamtime > pairmtime:
            # now regenerate the data file according to the options defined above:
            print "data file is newer than pair file"
            #if there's a script to run
            if self.script != None:
                #then execute the script
                retcode=subprocess.call("python2 %s"%self.script, shell=True)
            #first delete out of date file, if it's there
            if os.path.isfile(self.outdir):
                os.remove(self.outdir)
            #perform simple parsing to make a file of just protein pairs and the value we care about
            #and save these using shelve
            db = openpairshelf(self.outdir)
            #open the data file
            c = csv.reader(open(self.datadir), delimiter=self.csvdelim)
            for line in c:
                #each line use the protein pair as a key
                #by formatting it as a frozenset
                pair = frozenset([line[self.protindexes[0]],line[self.protindexes[1]]])
                #and the value is indexed by valindex
                db[pair] = line[self.valindex]
            db.close()
        return None

We have to rerun the script to generate a pre-processed HIPPIE file:


In [82]:
%%bash
java -jar HIPPIE_NC.jar -i=../DIP/human/training.nolabel.positive.Entrez.txt -t=e -l=0 -o=prematch.positive.HIPPIE.txt

Then we can use the parser to quickly perform what is done in the notebook previously used to extract the features:


In [113]:
x = ProteinPairParser("prematch.positive.HIPPIE.txt","training.positive.HIPPIE.txt",protindexes=(1,3),valindex=4)

In [114]:
x.regenerate()


data file is newer than pair file

The function reports that the data file is newer than the pair file and regenerates the pair files as a persistent dictionary object. This is useful because it means that this can later be indexed quickly for building feature vectors.

We can open this database to show this functionality:


In [115]:
test = openpairshelf("training.positive.HIPPIE.txt")

If we look at the keys that this database uses we can see that it is using strings internally. It could be useful to modify the .keys() method of this function so that this would produce the list of frozenset keys instead.


In [116]:
test.keys()[0:10]


Out[116]:
['10013\t55031',
 '10094\t2885',
 '10399\t8364',
 '10971\t8452',
 '1109111091',
 '11198\t142',
 '1326\t9020',
 '13917118\t7322',
 '1869\t142',
 '2\t2885']

And taking one of these keys and turning it into a frozenset, we can then index the database using that as a key.


In [117]:
testkey = test.keys()[0]
testkey = frozenset(testkey.split("\t"))
print testkey
test[testkey]


frozenset(['10013', '55031'])
Out[117]:
'0.72'

In [118]:
test.close()

The problem is that a shelve (shelf?) database can't take the frozenset as a key. The recommended way to deal with this is to make a wrapper. As the database used by shelf is a class we can build a child class from this, modifying the functions to deal with protein pairs stored in frozensets as keys. This code will not deal with arbitrary frozensets as keys.

Essentially, it will use a string of the two protein identifiers separated by a tab as the key. To index though it will take a frozenset and convert it to two strings which are the two iterations of the two strings.


In [110]:
class ProteinPairDB(shelve.DbfilenameShelf):
    '''A simple database for protein pairs using shelve.'''
    def __setitem__(self,key,value):
        #key will be frozenset so make it a list first
        key = list(key)
        #then make it a string
        if len(key) == 1:
            key = key[0]*2
        else:
            key = key[0] + "\t" + key[1]
        shelve.DbfilenameShelf.__setitem__(self,key,value)
        return None
    
    def __getitem__(self,key):
        #make two strings from the key
        key = list(key)
        if len(key) == 1:
            key1 = key[0]*2
            key1 = key[0]*2
        else:
            key1 = key[0] + "\t" + key[1]
            key2 = key[1] + "\t" + key[0]
        #try the first one
        try:
            value = shelve.DbfilenameShelf.__getitem__(self,key1)
        except KeyError:
            #couldn't find the first key, try the second
            value = shelve.DbfilenameShelf.__getitem__(self,key2)
            #if we don't find this one then error out as usual
        return value

We also have to define a function to apply default arguments to have similar functionality to shelve:


In [57]:
def openpairshelf(filename, flag='c', protocol=None, writeback=False):
    return ProteinPairDB(filename, flag, protocol, writeback)

The next thing to do is write the code for the feature vector assembler. This is another class, with methods:

  1. Regenerate protein pairs:
    • Sends a signal to all instantiated protein pair parsers to regenerate the pair file if required (if the data source has been updated).
  2. Initialise protein pairs from data source table.
  3. Assemble feature vector files from protein pair files.

And at initialisation it is expected to do:

  1. Loading in the Data source table and instantiating all required protein pair parsers.
  2. Loading in the gold standard protein pairs files, or loading the file names.

To do this it must parse the data source table. It assumes that the data source table is provided as a full path and that this path is the top directory for the data. ie all of the data paths in the data source will be relative to the path of the table itself. The table itself will have structure:

Data source directory Output database directory Options
/relative/path/to/data /relative/path/to/output/database protindexes=1,3;valindex=4;script=/path/to/script;csvdelim=\t

The available options are given in the options column of the table above.

To test initialisation a first draft of the code is given below:


In [147]:
class FeatureVectorAssembler():
    '''Assembles feature vectors from protein pair files, data source lists and gold standard protein pair lists.'''
    def __init__(self,sourcetab,goldpos,goldneg):
        #store the initialisation
        # directories for positive and negative gold standard data files
        self.goldpos = goldpos
        self.goldneg = goldneg
        
        #check if the gold standard data directories exist
        # throw an error if they don't
        
        #instantiate protein pair parsers
        # first parse the data source table
        # store the directory of the table and it's name
        self.sourcetabdir,self.tabfile = os.path.split(sourcetab)
        # open the table and parse for initialisation options
        c = csv.reader(open(sourcetab), delimiter="\t")
        # iterate over lines adding to list of protein pair parsers
        self.parserinitlist = []
        for line in c:
            #store the information in a dictionary
            d = {}
            d["data path"] = line[0]
            d["output path"] = line[1]
            #store options in a dictionary in the dictionary
            d["options"] = {}
            options = line[2].split(";")
            for x in options:
                #split each option to find out which option it is:
                x = x.split("=")
                #store it in the dictionary
                # if there are invalid options this code WILL NOT DETECT THEM
                d["options"][x[0]]= x[1]
            #copy the dictionary into the list
            self.parserinitlist.append(d.copy())
        #then initialise each of these parsers and keep them in a list
        self.parserlist = []
        for parser in self.parserinitlist:
            self.parserlist.append(ProteinPairParser(parser["data path"],
                                                     parser["output path"],
                                                     **parser["options"]))
        return None

Testing initialisation requires a data source table file so this file is created below at the top directory for the data files.


In [139]:
cd /home/gavin/Documents/MRes/


/home/gavin/Documents/MRes

In [141]:
f = open("datasource.tab", "w")
f.write("HIPPIE/prematch.positive.HIPPIE.txt" + "\t" + "HIPPIE/training.positive.HIPPIE.txt" + "\t" + "protindexes=(1,3);valindex=4")
f.close()

Checking this file has written properly:


In [143]:
%%bash
cat datasource.tab


HIPPIE/prematch.positive.HIPPIE.txt	HIPPIE/training.positive.HIPPIE.txt	protindexes=(1,3);valindex=4

Then attempt to initialise the above version from that. Notice here that the FeatureVectorAssembler requires gold standard data sources at initialisation in this version. This is removed in the second version as it made more sense to allow arbitrary protein pair lists in the feature vector assemble method.


In [148]:
test = FeatureVectorAssembler("datasource.tab",
                              "DIP/human/training.nolabel.positive.Entrez.txt",
                              "DIP/human/training.nolabel.negative.Entrez.txt")

It initialises ok, so the second draft of the code with the required methods is given below:


In [174]:
class FeatureVectorAssembler():
    '''Assembles feature vectors from protein pair files, data source lists and gold standard protein pair lists.'''
    def __init__(self,sourcetab):
        #instantiate protein pair parsers
        # first parse the data source table
        # store the directory of the table and it's name
        self.sourcetabdir,self.tabfile = os.path.split(sourcetab)
        # open the table and parse for initialisation options
        c = csv.reader(open(sourcetab), delimiter="\t")
        # iterate over lines adding to list of protein pair parsers
        self.parserinitlist = []
        for line in c:
            #store the information in a dictionary
            d = {}
            d["data path"] = os.path.join(self.sourcetabdir,line[0])
            d["output path"] = os.path.join(self.sourcetabdir,line[1])
            #store options in a dictionary in the dictionary
            d["options"] = {}
            options = line[2].split(";")
            for x in options:
                #split each option to find out which option it is:
                x = x.split("=")
                #store it in the dictionary
                # if there are invalid options this code WILL NOT DETECT THEM
                d["options"][x[0]]= x[1]
            #update the script directory
            if "script" in d["options"].keys():
                d["options"]["script"] = os.path.join(self.sourcetabdir,d["options"]["script"])
            #copy the dictionary into the list
            self.parserinitlist.append(d.copy())
        #then initialise each of these parsers and keep them in a list
        self.parserlist = []
        for parser in self.parserinitlist:
            self.parserlist.append(ProteinPairParser(parser["data path"],
                                                     parser["output path"],
                                                     **parser["options"]))
        return None
    
    def regenerate(self):
        '''Calls all known protein parsers and gets them to regenerate their output, if they have to.'''
        for parser in self.parserlist:
            parser.regenerate()
        return None
    
    def assemble(self, pairfile, outputfile):
        '''Assembles a file of feature vectors for each protein pair in a protein pair file supplied.
        
        Assumes the pairfile is tab delimited.'''
        # first parse the pairfile into a list of frozensets
        pairs = map(lambda l: frozenset(l),csv.reader(open(pairfile), delimiter="\t"))
        # open the file to put the feature vector in
        c = csv.writer(open(outputfile, "w"), delimiter="\t")
        #open all the databases and put them in a dictionary
        dbdict = {}
        for parser in self.parserinitlist:
            dbdict[parser["output path"]] = openpairshelf(parser["output path"])
        
        # then iterate through the pairs, querying all parser databases
        for pair in pairs:
            row = []
            lpair = list(pair)
            row = row + lpair
            for parser in self.parserinitlist:
                row.append(dbdict[parser["output path"]][pair])
            c.writerow(row)
            
        #close all the databases
        for parser in self.parserinitlist:
            dbdict[parser["output path"]].close()
        
        return None

Testing initialisation of this code again:


In [175]:
test = FeatureVectorAssembler("/home/gavin/Documents/MRes/datasource.tab")

Regenerating the data sources. It does not report that any of the data sources require regeneration so nothing is done this time:


In [176]:
test.regenerate()

Then testing the assembly of a feature vector file:


In [177]:
test.assemble("DIP/human/training.nolabel.positive.Entrez.txt", "testoutput")

Looking at this file we can see that it has produced a feature vector with only one feature. It also reports the associated protein pairs next to this feature. This is removed from later versions as it makes later classification more complicated.


In [178]:
%%bash
head testoutput


4084	207	0.86
8360	8356	0.97
5914	9612	0.9
79833	6634	0.62
29102	4090	0
7074	6382	0
7159	22059	0
1869	7029	0.96
801	817	0
207	1786	0.7